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1 Introduction 

The PS system deals primarily with specifications which describe arrays, and which are simple 
enough for a symbolic manipulation process whose output will be very straightforward procedural 
code. The language is described in Gokhale (1986a, 1986b), and the current implementation in 
Torgersen (1986). The methods reported here extend the system in order to generate simple classes 
of parallel programs by solving a series of (pure, linear) integer programming problems. The type of 
program generated is a “hyperplane” program: a single time-series of concurrent loop bodies, where 
the loop bodies contain only conditional assignment to array locations. In the input equations, the 
arrays axe defined via recurrences. 

Recurrence relations axe a common way of expressing problems in a variety of domains. The 
recurrence relation is often thought of as defining an iteration in which a new value is a function of 
certain previously computed values. A great deal of research has gone into parallelizing recurrence 
relations. We have used work by Allen (1983a, 1983b), Banerjee (1979), Cytron (1986), Kuck (1980, 
1981), Midkiff (1986), Polychronopoulos (1986), and Warren (1983). Lamport (1974) introduces 
hyperplane scheduling, in which the nested loop indices of an iterative program are viewed as 
defining a single multidimensional M\ x Af? . . . x M n array of events. For some common kinds of 
computation, we can schedule large numbers of events in parallel by selecting a time axis; at any 
time t, those events will be scheduled which lie on the hyperplane normal to the time axis at t. When 
this method is successful, the time required is reduced from the volume of the multidimensional 
array down to the length of some line through it, provided that enough processors are available. 
These notions will be carefully described below. 



There are many ways of treating equational specifications as programs, and issues of trans- 
formability and parallelism crop up frequently in the history of this style of program development. 
McCarthy (1960) showed how recursive equations could be directly interpreted; much recent work 
in algebraic manipulation and parallel interpretation of equational programs has centered around 
the suggestions of Backus (1978). A great deal of such research has emphasized generality and 
expressive power for equational languages, such O’Donnell (1985) in which (although a form of 
compilation is used) the design emphasis is on semantic completeness. The MODEL system of 
Prywes (1983), in contrast, emphasizes restriction towards a restricted class of equations for which 
simple, efficient programs can be generated, and the PS system of Gokhale (1986a) follows this 
approach with an additional goal of parallel execution. 

In this paper, we introduce a new method to obtain a hyperplane schedule, by manipulation 
of an equational description of the problem rather than indirectly from an iterative program. Our 
method can be used in the presence of certain forms of symbolic values in dependency vectors 
(defined below) derived from the recurrence. In addition, the method can be used on mutually 
recursive arrays. 

In this work, we begin with an equational specification. A specification does not demand a fixed 
execution order, but merely specifies data dependency constraints on the computation of each data 
point. Our goal is to schedule computation of as many points in parallel as possible. To make the 
problem more tractable, we will assuipe that the time at which a data point can be computed is a 
linear combination of its subscripts. This “hyperplane” assumption allows us to use simple integer 
programming techniques to solve a variety of subproblems. 

In the equational language, each subscript used to index the array being defined is assumed to 
belong to an index set, a contiguous range of cardinals. This corresponds to the Pascal concept of 
a subrange type. 

For example, we describe the combinatorial problem of counting all the partitions of a cardinal 
n as the specification 


part[n ] 


/M] 

0, if n < m 

1, if n — m 

f{n — m, m] + /[n, m + 1] otherwise 


In this case, part[n\ is the number of partitions of n and /[t,j] is the number of partitions of i 
which use no values less than j. The conditional equations axe structured conventionally, and we 
will not provide a formal definition for the language. We write a sequence of rules, like the example 
above, with each rule in the form <L item=value f if condition Each item is a scalar, usually an 
array component; each value is an expression whose value is to be that of the item, provided that 
the corresponding condition is True. We normally end with a default equation, labelled with the 
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trivial “otherwise” condition which is always assumed to be True . For all such sequences, we expect 
the rules to be tried in order, i.e. each rule presupposes the complement of the conditions of the 
preceding rules. 

We make a further assumption, less common with users of equational notations: we assume 
that n and m will be explicitly declared as belonging to index sets. In the problems which we will 
work with, this presents no difficulties; for example, if we want to find part[n\ for values of n in 
0 . . . k then it is acceptable to have m in the same set. There are problems for which the description 
of the index set is as hard as finding the answer to the problem; we will ignore these. 

The partition problem can be solved either in a goal-driven (top-down) mode which consists of 
a depth- first traversal of the search tree starting with goal node /(n, 1), or in a data-driven (bottom 
up) mode starting with /( 1, 1) (see Henzinger (1986)). In the following, we will use the data-driven 
model. The schedule our method generates from an equational specification will be expressed in 
pidgin Algol, augmented with a concurrent forall loop structure. 

t orall ij € 0 . . . N, k € -3 . . . 300 do S 

specifies 304(iV + l) 2 concurrent instantiations of S, one for each possible value of t,j, k. The 
instantiations may actually be simultaneous or in any sequential order. 

The most direct computational model for such a schedule is an n-dimensional MIMD processor 
array in which there is a processor for each point in the Cartesian product of the index sets. More 
generally, we can partition the Cartesian product into contiguous “blocks”’ (blocksize > 1) and 
have one processor for each block. 

A schedule can contain ordinary for loops, with assignment and reassignment to scalars and 
array components. However, our equational language does not have the notion of reassignment. 
Therefore, a recurrence over an n-dimensional array is described by an nf 1-dimensional array in 
the specification. Each new iteration describes a new element of the extra dimension. 

The next section gives some examples to motivate the work. Afterwards, we will show the 
method as applied to the examples. 

2 Sample Recurrences: Regular and 111- Structured 

The set of equations shown in Figure 1 is typical of many problems in numerical analysis in which 
points are affected only by neighboring points in a highly regular way, with the boundaries as sole 
exceptions to the basic rules. A succession of grids is computed (indicated by the k subscript) in 
which a point on a new grid is defined as the weighted average of its neighbors on the previous grid 
and the current grid. Values on the boundary are copied from the previous grid. Interior points 
take the value of the corresponding point on the previous grid modified by a small amount. 

This same conceptual structure (highly regular interior with exceptions only at the boundaries) 
can be seen in a simpler problem from a completely different area. Calculation of Pascal’s Triangle 
is specified in Figure 2. 
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A k [*, i] 


= 0 , if {*, j} n { 0 ,M+ 1 } ^ {} 

= ^k-i[iyj] + - 1] 

+Ak[i - l,j] + Ak-i[i,j + 1] + Ak- 1[* + l,j] 
-4 x A k -i[i,j] 

-h 2 x /(t, j)/4) otherwise 


Figure 1: A Relaxation Recurrence 


= 0, if 0 € {*, j} 

= P[i - l,j] + p\i,j - 1] otherwise 


Figure 2: Pascal’s Triangle 


The data dependency graph of Pascal’s Triangle is shown in Figure 3. As can be seen from the 
figure, the pattern of dependency is very regular. The arrows represent vectors of data dependency, 
or dependency vectors . If the dotted diagonal line represents a time line, all points on a normal 
to the time line can be computed at one time. The succession of lines (or in the general case 
hyperplanes) normal to the time line at discrete steps describe an iteration in time. At each step 
of the iteration, all points on the normal hyperplane can be computed concurrently. 

These two examples represent a class of problems which axe easy to solve in parallel; our first 
method will deal only with such problems. The next example shows a much less regular dependency 
pattern. The familiar greatest common divisor definition is shown in Figure 4. As cam be seen from 
Figure 5, dependency axes do move outwards from the origin, but their structure is not very regular. 
If there is a “hyperplane of parallelism,” it is not immediately discernable. (The ged specification 
can be written in several ways, with greater or lesser regularity. Some readers may wish to write 
down a few alternative definitions and follow our methods on them.) 

Our starting point here is that parallelism is often linear, and the lines of parallelism can be 
found mechanically. Lamport (1974) shows a mechanical method for finding the parallelism in some 
of these structures. That method depends on the conditions that 

1. each point has the same number of dependency vectors, and 

2. corresponding vectors for each point have the same length 
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Figure 3: Pascal Triangle Dependencies 


9[hj] 


j, if » = 0 
if j = 0 

ff[* - J\j] ^ * > 3 

y[*,j — i] otherwise 


Figure 4: The gcd Specification 
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Figure 5: gcd Dependencies . 

This condition holds for the Relaxation and Pascal’s Triangle dependencies, but does not hold for 
gcd . 

In the next section, we describe a method for finding hyperplanes when the dependency vectors 
do not necessarily have a constant length. We will lead in to this presentation with a variant on 
Lamport’s method, changed only as fair as necessary to adapt to a purely declarative context. 

3 Finding Hyperplanes: A Restricted Method and its Extension 

3.1 Linear Solutions for Array Schedules with Fixed Dependencies 

We will illustrate the restricted method with an informal example before describing the rules in 
detail. The fundamental restriction, which no amount of effort on our part will remove, is that no 
value should be created before the values on which it depends. Thus in Pascal’s Triangle, P[t,j] 
generally cannot be created until after P[t - 1, j] and P[t,j — 1] are available. 

1. Our approach is based on the idea of defining the time of creation for each array element as 
a linear combination of the indices, each multiplied by a symbolic coefficient. For Pascal’s 
Triangle, this gives us the time equation 

t(P[iJ}) = <*i + bj . 

Our first problem is to find usable values for the symbolic coefficients a and b. 
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2. We now represent the problem’s dependence ordering with (strict) inequalities involving time. 
In this case, the time for P[i, j] must come after the time for P[i — 1, j] and also after P{^J- 1]; 
that gives us two dependence inequalities , namely 


at + bj > at + b(j — 1) => b > 0 
at + bj > a(t — 1) + bj => a > 0 

3. Now we can find the least integers a and b for which these dependence inequalities will hold. 
In this case, we get a = 1 and b = l 1 . 

4. We can now replace the time equation’s symbolic coefficients by their values. In the case of 
Pascal’s Triangle, this leads to 

= *+J 

All array elements P[t,j] such that t + j = t will be defined at time t. For given £, these 
entries comprise a “hyperplane” , at right angles to the vector (1,1). (see Figure 6) As t is 
increased from 0 to £jvf ox = *Mo* +jMaz 2 > we find a sequence of such hyperplanes which cover 
every point in the array. 

5. At this point, we can write a (poor) schedule: 

lor t G 0 . . . t\f ax do 
forall ij do 

if t = i + j then 

if i = 0 V j = 0 then P[i,j] := 1 

else P[t,j] := P[t\ j - 1] + P[i - 1 J] 

6. A better schedule can be found with more work by folding the time dimension as one dimension 
of a transformed array P*. We will define P ; [ t\j f ] = P[ t,y] and have P ; [/,./] be constructed 
at time i f . Thus, we transform the coordinates t,j for the array into coordinates such 
that i' = t = t -f j. The j* dimension can be anything not parallel to i f ; manipulations are 
easier if we use j itself. We now have a linear transformation from t,j to t \j\ which maps 
integers into integers. Its inverse will also have that property provided that the mapping is 
one-to-one as described in Lamport (1974). Specifically, we construct 

*' = t + j,j' = j , » = »' - j',j = j 1 

PM = nV]^ P'[i + jj 1, P'li'J'} = P[iJ) = P[i' ~ j'J'} 

1 Actually, we are minimizing the absolute values of a and 6. 

3 We will deal later with the problem of finding the range of the transformed time coordinate. 
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7. These equations can now be used to convert the whole problem of evaluating P into one of 
evaluating P 9 . We need only perform a few substitutions and symbolic simplifications. 


— 1, if t — j 9 = 0 V J = 0 by substitution 

= P[i* - j ' - 1,/] + P[i' - 3,3 - 1] otherwise 

= P f [ t — l,j 9 } + - 1 ,j 9 — 1] otherwise by simplification 


8. We can now trivially find a schedule for the transformed array P 9 , using its own coordinate 
V as (sequential) time and the remaining coordinate j f as (parallel) space: 

for i'eO... i' Max do 
forall j 9 do 

i f i 9 - j 9 = 0 V f = 0 then P 9 [ V,j 9 } := 1 

else Py,j 9 ] := - 1,/] + ^[i*' - 1,/ - 1] 

9. P was presumably used within some top-level program T: we define the transformed program 
T 9 as the result of replacing each term P[x, y] in T by P 9 [x + y,y] 3 . Further simplification 
is possible, but the main point has been made: the invertibility of linear transformations, 
and the use of algebraic simplification on transformed expressions, makes it possible to save 
a great deal of computation. 

Generalizing on this example, we can sketch the first version of our method as follows: 

1. Assume that time will be a linear combination of the indices. Write a linear time formula for 
the array, with symbolic coefficients. 

2. Write the dependence inequalities for each recurrence in the array definition. 

3. Solve these inequalities for the coefficients required. (This can be trivial in many cases, or 
can be solved by an integer program otherwise. A variety of methods are applicable (as a 
basic reference, see Taha (1974)); we have preferred to use the simplex algorithm to find an 
approximate floating-point solution, then add constraints to force integer solutions.) 

4. Substitute the coefficient values into the time formula. 

5. Write a preliminary schedule which repeatedly tests each index vector to discover if it is yet 
time to evaluate the corresponding array element. 

* Alternatively, we could redefine P as a function which merely produces the proper entry from P' . 
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6. Transform the coordinates into a space in which time itself is an array index. 

7. Use substitution and simplification to make the transformed array dependent on itself, not 
on the original. 

8. (Trivially) find a schedule for the transformed array. 

9. Use substitution to form the transformed top-level program, which refers only to the trans- 
formed array. 

Clearly, we have no solution if the dependence inequalities were unsolvable. That simply means 
that there was no way of scheduling the array as some linear combination of indices. Equally 
clearly, the preliminary schedule will only be used if the array transformation is unsuccessful; the 
transformation does not change the length of the time axis, but diminishes the number of active 
processors required by an order of magnitude. In a wide range of programs such as the Relaxation 
recurrence above, the transformation will be successful and will not require any ideas beyond those 
presented in the Pascal’s Triangle example. Some will be trivial in that all but one coefficient will be 
zero, so that the process has the effect of identifying time with an existing dimension, perpendicular 
to a hyperplane of computation. Nontrivial combinations of indices (as in the example just given) 
will form hyperplanes which are “diagonal” in the existing array; in such cases, the transformed 
array’s structure will be slightly different from that of the original array, and we might conceivably 
fail in array transformation if the coordinate transformation is not invertible. 

Even when we get a transformed array, it may be larger than the original array; the array P 9 
is twice as large as P. In cases such as Pascal’s Triangle or Relaxation, however, we can use the 
transformed array definition to guide storage reuse. The definition of P did not give us a regular 
way to reuse the storage space for P[i,j], but P 9 [ t\ *] can be thrown away as soon as P'[ i* + 1, *] is 
constructed. The MODEL system reported in Prywes (1983), on which ours is based, implements 
this kind of analysis; it is straightforward, and we will sketch it briefly. 

Suppose that we have the recursively defined transformed array P 1 for which one dimension is 
time. Suppose also that only the final values in P 9 are required, or else that elements of P 9 are 
to be output as they axe produced. Under these assumptions, we can reuse the space occupied by 
old hyperplanes, putting new ones in their places. Specifically, we can use integer programming 
once more to solve for the minimal k such that A: — 1 is an upper bound on the time-length of a 
dependency vector. In the case of Pascal’s Triangle the upper bound is clearly 1, so k ~ 2. Only 
two hyperplanes are computationally necessary at any given moment: k - 1 source hyperplanes and 
a target being produced from them. 

Now, instead of allocating an array of size i 9 Max X we can allocate an array k X f Max and 

use the following schedule. 
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for t' e 0...i' Max do 
forall j* do 

if i f — f = 0 V f ~ 0 then P* [i ' mod k,f] := 1 

else P'[ t f mod k,f] := P'[(t' - 1) mod k,f] + P'[(t' - 1) mod k,j' - 1] 

The storage requirements of the computation have now been reduced by a factor of 
which in many cases (like this one) will be an order of magnitude improvement. 

Our method as described so far primarily reproduces the functionality of Lamport (1974) in a 
purely equational context, with two improvements due primarily to the context. First, rather than 
“parallelizing” an existing sequential solution, we work on a specification without prior program- 
ming effort. Second, our development of a transformed array makes storage reuse straightforward; 
it is not at all clear how to use this idea on the several kinds of dependencies in procedural code. 
However, we have not yet shown any method for solving dependency structures which could not be 
handled by his method. 

Now let us consider the third example from Section 1: the greatest common divisor defini- 
tion. This example was chosen because of its highly irregular dependency vectors. We repeat the 
definition, contracting the first two lines in order to motivate a step in our analysis: 

s[*\i] = « + y» if t = o v j = o 
= s[t - y,j] ify < » 

= g[j — t, i] otherwise 

The irregularity of the dependency vectors is reflected in the structure of the dependence in- 
equalities: 


at + bj > at — aj + 6j, at + bj > aj — at + bi 


which simplify to 

0 > -aj, a(2t - j) > b( % - j) 

By inspection (or by trial and error) we can find that the definition can be scheduled with a = 1, 
b = 1 as before. The same crude schedule or the same transformation cam be applied. However, this 
is a system of two inequations in four unknowns. On the previous example, the t and j subscripts 
cancelled out, and we could solve a set of inequalities in the symbolic coefficients a and 6 by an 
integer program. 

This definition is different for two reasons: 

• The dependencies are not local, or even of fixed length: dependency vectors in one part of the 
array will have sizes different from those of other parts of the array. Therefore the inequalities 
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constraining a and b are parametric inequalities using i and j, which cannot be solved by 
integer programming* From their form, it appears that they should be true for all possible 
values of t and j. However. . . 

• There are different, apparently conflicting dependencies on different conditionals. The in- 
equality 0 > — aj cannot be true throughout the array, since there axe points where j = 0. 
This didn’t prevent a = 1 from being part of a working solution, because 0 > -aj was only 
required at points which had failed the tests t = 0 V j = 0. In fact, we only want a depen- 
dence inequality to be valid for points (t, j) which would cause that dependence to arise. This 
clearly depends on the branching structure of the recursive equations used to define the array. 

Suppose we axe trying to solve gcd(N^M) using the axray which (we know in advance) will 
be of size N x M . The dependence inequality aj > 0 must be satisfied at every point (*,j) for 
which 0 < j < f. In principle, we could write out a • 1 > 0, a • 2 > 0, a • 3 > 0, and so on up to 
a • N > 0. (Evidently, we must know N as a constant.) Similarly, a(2t — j) > b(i - j) must be 
satisfied at every point 0 < t < j. We have a total of roughly ( N X M)f 2 inequalities, which can 
ail be written down explicitly and solved. The process could be mechanized as a partial symbolic 
evaluation of the gcd definition for every *,j point. Remember that the definition is a sequence 
of equations, each in the form item==value, if condition . Let us define the equation selected for a 
particular («,j) pair as the first for which the corresponding condition is true. This equation will 
have one value, with zero or more dependence inequalities involved in the description of that value. 

To generate the set of inequalities, we can follow an algorithm such as: 

lor each t 6 0 . . . N,j £ 0 . . . M do 

let *item=val f if cond 9 be the equation selected for i,j 
lor each dependence inequality in item=val do 

write out that inequality (with constants lor t and j) 

Notice that this only works for conditionals which are independent of the contents of the array; 
they depend upon the indices, i.e. upon where we are in the array. For such conditionals, the 
test can be carried out before the program has any data on which to run because the cond can be 
simplified to Drue or False independently of the array’s contents. 

If the conditionals themselves depend upon the run-time data, then we cannot tell which test 
is satisfied by a given t, j pair. For example, the conditional equation 

A[i] — 1 + 2 * A[i - 1], if A[i - 1] mod 2 = 1 

has a condition which depends on the value stored in A[i— 1]. Since we don’t know this value before 
running the program, we can’t tell whether the condition will be true or false and we can’t be sure 
whether this equation will be selected. In such cases, we have to consider all possible equations 
which might be selected at run-time. To consider all of them, we must redefine the selected for 
concept: instead of a single equation, it must describe a set. Specifically, an equation will be said 
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to be selected for a particular (t,j) pair if its condition does not simplify to False on t,j and no 
previous test simplifies to True. The let statement used above will then become a loop. 

After the algorithm above, we axe left with a (very large) collection of linear inequalities on a 
and 6. We can pass them to an integer programming system, asking for minimal solutions, but 
since the collection may be larger than the dependency graph itself, we should not hope for too 
much, gcd provides the simplest case for this kind of analysis: there is one dependence inequality 
for each (non-boundary) array element, to be computed presumably for the largest possible values 
of N and M . This is not a very useful solution. 

Instead of instantiating each dependence inequality for all possible points at which it might be 
applied, we will have to take some sort of a sample. For each arm of the conditional, we should 
select some collection of points (i.e., of values for t and j) at which its dependencies should be 
satisfied, make linear inequalities by substituting the values into these dependencies, and solve the 
resulting set. The next question, and the basis for the next refinement of the method, is: what 
should we use for sample points? 

3.2 Sample Point Selection 

We will assume that coefficients have been selected for each parameter, and the dependency vectors 
calculated as before. For each branch in the conditional recursive equation we find a constraint- set, 
i.e. a conjunct of linear comparisons (equalities and inequalities) which arise from the branching 
structure itself; this will include the negation of the previous conditions. For the g array we are 
considering, this is awkward because the first line is a disjunction. However, we can split it into 
a pair of conditional equations, and then we are able to express the branching information by 
constraint-sets. 



Constraint-set: 


= i + j, if * = o 

{«• = 0} 

(1) 

= i + j, if j = 0 

O 

tl 

o 

A 

• *4 

(2) 

= s[* -j,j] ^ j < « 

{0 < j,j < »} 

(3) 

= g[j — i, t] otherwise 

{0 < «,j > •} 

(4) 


This constrained equation set is the basis for a consistent solution to the scheduling problem, despite 
the inconsistencies between the dependency vectors under different conditions. The constrained 
equation set explicitly shows information which was implicit in conditions of the original definition. 
The constraint-set for each equation defines a set where that equation may actually be used to 
compute something: outside that set, we do not caxe whether the dependence inequalities of that 
equation are satisfied or not. 

For example, the dependence inequality 0 > —ay derived from equation 3 need only be satisfied 
when 0 < j Aj < *. Again we solve for the least t and j which satisfy each constraint. That gives us 
the base point po = (1,2), which is our first sample point for equation 3. The inequality 0 > - aj 
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from that equation wall be instantiated at po as 0 > —2a. What other points are to be included? 
For this subproblem, it doesn’t matter; if a is an integer satisfying 0 > — 2a then 0 > — 3a will also 
be true. All we will find about a from equation 3 is that a is a strictly positive integer. 

Similarly, the dependence inequality a(2i— j) > b(i—j ) from equation 4 need only be instantiated 
on sample points satisfying 0 < t A j > ». If we find (to, jo) as the least i and j satisfying those, we 
find po = (*o, jo) = (1,1) for this problem. Instantiating the dependence inequality here leads to 
a > 0. What other points are to be included? Our first strategy will create one for each dimension: 
pi will “stretch” po in the t direction, and p? will do the same in the j direction. In other words, 
pi will be the least (t, j) pair satisfying the inequalities that led to po and also satisfying i > *o- 
Similarly, p^ will satisfy j > jo . This simple strategy leads to pi = (2,2) and = (1,2) which give 
us the dependency instantiations 2a > 0 and 0 > -6, respectively. 

Putting all of our inequalities together, we are looking for the least a and b such that 0 > -2a, 
a > 0, 2a > 0 and 0 > -6. There is a lot of redundancy here, but the system is quite finite, and 
has the solution a = 1, b = 1 which we found by inspection at the beginning. 

At this point, we can restate the whole process, in condensed form. For convenience we will 
put the sample-point construction process first. 

• Generate Sample Point Set 

1. Rewrite the conditioned recurrence as a constrained equation set. (If the conditions axe 
expressed as Boolean combinations of linear equalities and inequalities on the indices, 
then this always works; disjunction is handled through case splitting.) 

2. For each such constraint-set, construct the sample point set. (This includes the least 
sample point, and the least extensions to it in each dimension. 

• Solve Dependence Inequalities 

1. Write a linear time-formula for the array evaluation. (This is always trivially possible; 
just assign coefficients to the indices of the array.) 

2. For each recursive dependence, write the dependence inequality. 

3. Instantiate each dependence inequality on each sample point for its equation. 

4. Solve the set of ail instantiations from the previous step to find a time formula for the 
array evaluation. 

5. Invert the formula, construct and schedule the transformed solution as before. 

Clearly this will not always work; there are many recursive definitions in which neither the 
conditions nor the dependencies can be described by conjunctions of linear formulas. 
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m[A;, i] 
G{k,i,j] 


The result is 


— G[fc, «, k\/G\k, fc, k] if t > k 

= 0, otherwise 

= °[*.jl if * = 1 

= G[k- l,t,j] - m[k — 1, »] * G[k — 1 ,k~ 1 ,j] if* > kAj > k 
= G[k - l,t,y] if « < k 

— 0, otherwise 

Gn* 


Figure 6: Mutual Recurrence: Gauss Elimination 


Other sample point selection strategies are possible. For example, rather than increasing in 
each dimension, as demonstrated above, we can instead “strengthen” each constraint inequality 
in each of the constraint sets. The base point would always be the same: in the constraint set 
{0 < j\j < *} f° r equation 3, we would have Pq = (2, 1). The extension points would be generated 
from the constraint sets {0 + 1 < j,j < *} and {0 < j, j + 1 < i}; in each case we simply add 1 to 
the lesser side of the inequation. Somewhat surprisingly, the results axe the same for this problem 
and many others. They axe not equivalent, but they are roughly comparable, and neither is entirely 
satisfactory. (We will explore both incompleteness and inconsistency issues later.) 

Another selection heuristic is to include all the corners of the (rectangular) array being defined, 
with whatever dependence inequalities each is selected for. This was actually our first heuristic; 
it has the very unpleasant property that it always requires knowledge of the actual values of the 
upper bounds for the index sets. 

4 Multiple Hyperplanes 

The previous section demonstrated the method on a single recursively defined array. Essentially 
the same algorithm can be used for arrays whose definitions involve mutual recurrence. Suppose 
there are N equations defining N arrays Ai . . . which are involved in a mutual recurrence. That 
is, 

= E{Aj,Ak) 

where j, k G 1 . . . N , and the expression E may contain references to both the array being defined 
Aj and to other arrays A*. Figure 6 shows an example of such a mutual recurrence in a fragment 
of the specification of Gauss Elimination. At each of the k iterations, m is the vector of multipliers 
for each row of the matrix G. m is defined in terms of G, and G’s definition has a reference to m. 
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for each array. We construct the time equation for each array and derive from it the constrained 
inequality set as before. The only novelty here is that we insert an additive constant, a starting 
offset, into the time formula for each array. This means that the time for an n-dimensional array 
will be expressed as the sum of n terms; it also means that we can express a constant difference 
between the times for corresponding components of different arrays. 

As before, there is one dependence inequality for each textual occurrence of an array involved 
in the recursion. 



a + bk + ci 

a + bk + ci > 

d + ek + f i + gk if * > k 

a + bk + ci > 

d + ek + fk J rgki£i>k 

<(<?[*,*',;]) = 

d + ek + fi + gj 

d + ek + fi ’ + gj > 

d + e(k - 1) + fi + gj if * > k, j > k, k 

d + ek + fi + gj > 

a + b(k — 1) + ci if * > k, j > k, k > 1 

d+ ek + fi + gj > 

d + e(k - 1) + f(k - 1) + gj if * > k,j 


> 1 

> k) k > 1 


Now this set of inequalities may be solved by the method of the previous section. Appropriate 
sample points are chosen for «, j, and fc, yielding a set of inequalities with the coefficients a 
through g as unknowns. The inequalities are solved with an integer program, giving us values for 
the coefficients, and those values are substituted into the original time inequalities £(m[fc,t]) and 
£((?[/:, t,j]) to give a pair of hyperplanes. The solution hyperplanes are as follows. 

t(m[A:,i]) = a + bk + ci , = d + ek + ft + gj 

= 1 + 2 Jfe, = 2k 

Now the problem is to schedule computation of the multiple arrays. A concatenation of schedules 
for each array (in this example an iterative solution for m followed by an iterative solution for G or 
vice versa) is incorrect because individual elements of each array depend on elements of the other. 
Each iteration must include evaluation of elements for every array in the recurrence. A geometric 
interpretation of this condition is that the hyperplanes are interlaced, a hyperplane for G followed 
by a hyperplane for m followed by a hyperplane for G and so forth. Therefore computation of each 
array must be embedded within a common iteration over time. In addition, each array must be 
checked to see whether its hyperplane is currently being evaluated. Only if it is on the hyperplane 
being computed can the array element actually be evaluated. 

These two conditions, a common iteration over time and an extra condition to check, are shown 
in the Gauss schedule. 
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lor t 60... ^Afax do 
lorall /, k do 

il t = 2A: + 1 then 
if * > A: then 

m[fc, i ] = G[fc,t, A]/G[fc, fc, k] 
else m[/c,t] = 0 
elsi 1 t — 2k then 

il t = l then G[fc,t,j] = a[*,j] 
elsi 1 i > k and j > k then 

= <?[fc - - m[k - 1,»] * G[k - 1 ,k- 1, j] 

elsi 1 » < A: then 

=G[k- l,i, i] 
else G[fc,t, j] = 0 

This follows the form of the first (non-transformed) schedule for Pascal’s Triangle. As demon- 
strated earlier, we may attempt to transform the coordinates to make one dimension of the array 
a time dimension. This cannot be done with the Gauss example because the inverse mapping does 
not yield an integer value. 

However, it is quite possible to come up with an algorithmically acceptable version. The multi- 
array hyperplane method will come out with solutions of this form (slightly generalized) for any 
problem in which array computations must be interleaved, and we can expect to see time formulas 
which look like this: 


t[A x [. ..]) = nE 

. .]) = nE + 1 

t{A n [. ..]) = n2?+(n-l) 

In this case, E is the expression (like k in the Gauss example) which defines the true time axis, 
and we can transform the arrays to make time be a dimension of each. (In the Gauss example, 
this is an identity transformation because k is already a dimension of each.) The constant offsets 
added to it require only that the schedule specify a sequence of hyperplane activations for each 
time rather than only one: 

for fceO... k\fax do 
begin 

lorall i, j do 

if k = 1 then G[fc, *,j] :=a[t, j] 

elsif i > k A j > k then G[fc, «, j] :=. . . 
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elsil i <k then G[fc,i, j] :=G[k — 1,i,j] 
else G[fc, 
lorall t do 

it i > k then m[fc, i] :=G[fc,i, k]/G[k y A;, A:] 
else m[fc,i]:-0 

end 

This particular method has neither great generality or great elegance, but it covers a common 
and natural case of mutual recurrence. A great deal can be achieved by such small extensions, 
each of which arises from some example which cannot be properly scheduled without extension. It 
seems reasonable to seek a larger generalization, one which will naturally create not only nested 
blocks but nested tor and f orall loops, but such recursive applications of the methods will have 
to be left for future work. 

5 Failures and Corrections 

5.1 Boundary Adjustments 

The schedules we have written up to now have assumed that “time” would run from 0 to some 
undefined maximum value. This is an oversimplification; the same integer-programming techniques 
which give us our sample points and the schedules themselves can be used to find the starting and 
ending points for the scheduled loop. Let us consider the partition problem with which we first 
showed our notation. 


part[n] - / [n, 1] 

/[n, m] — 0, if n < m 

= 1, if n — m 

= f[n — m, m] + /[n, m + 1] otherwise 


The constraint-sets will be {n < m}, {n = m}, find {n > m}; the first two arms of the conditional 
are non-recursive, but the third requires that we generate sample points (1,0), (2,0), and (2,1). If 
we assume t(/[n, m]) — an + bm we get the dependence inequalities 0 > -am and 0 > 6, so we 
need a solution for a > 0, 2a > 0, and b < 0. 

This is easily solved to produce a = 1, b = —1, so we find that i(n,m) — n — m. Unfortunately, 
this seems to imply a negative time for half the array. It doesn’t matter; we let time range from 
t\fin to It just happens that we have dealt with examples so far in which tMin was 0, but we 

will actually find and t\fax by the same process, which we have skipped over thus far. 

Once more, we solve an integer program involving linear inequalities: in this case we already 
have an expression for time, and we have n and m as index sets (i.e., we know the constants n^, n , 
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riMfli, m\fin and m\fax such that n^fm < n < riM ax and m Min < m < rriMax- To solve for the 
minimal (maximal) time, we solve for the t and j which minimize (maximize) the time expression 
subject to those boundary conditions. In the part example, this gives us tj^in = “ r^Max and 

tMax = n\fax — ™Min- 

Similarly, we can note that the proper base value mjvftn is actually 1; using it throughout the 
method will make trivial changes in intermediate calculation, but none in the final schedule. (The 
final array, of course, will be slightly smaller.) Handling details like this is essential to the creation 
of a working system, but seems to add little to the understanding of the basic problems of equational 
transformation and scheduling. 

5.2 Incompleteness 

The method works successfully on many common examples, but it does fail on examples which axe 
clearly schedulable. Suppose we have X[0 . i . N] and Y [0 . . . N] defined as follows: 

X\i] = 1, if » = 0 

= 2 + 3 * X[x — 1], otherwise ; 

Y[i\ = 0, -if * = 2V 

= + Y[t + 1] otherwise 

Here a “good” schedule will go up X to X[N] and then down Y . We should be able to write this 
schedule without knowing the specific value associated with N ; a time formula such as t(X[t]) = t, 
t(Y [t]) — 1 — t + 2 N would do fine. 

This solution clearly cannot be generated by the method described; the sub term 2 N is not 
a constant that we can handle by integer programming as we propose to use it. Fortunately, 
there is an easy extension which covers this case and other common cases in which boundaries are 
symbolic constants: we treat these as parameters of the problem, and generalize our time formulas 
to £(X[t]) = a + bi + cN and t(Y [»]) = d + e t + fN . The three dependencies are now b > 0, e < 0, 
and d + ei + fN > a + 6t+ cN. Since N is a boundary, we have the inequality t < N from the 
outset. Sample points i = 0, N = 0, » = 1, N = 1 and * = 0, N — 1 expand the third inequality into 
d>a, d + e + f >a + 6 + c, and d + f > a + c. The solution is b = 1, e = — 1, d = 1, a = 0, c = 0, 
/ = 2 as desired. It is important to note that if N were not a boundary, or if we failed to include 
the inequality t < iV, this solution would not work since we would have to include the sample point 
t = 1, N = 0. At that point, the third inequality becomes d + el + /0 > a + 61 + cO which for this 
solution isl—l + 0>0-i-l-f-0. In fact, no hyperplane solution to the problem is possible without 
inclusion of the boundary as a parameter. The same will hold true for many problems in which 
information moves inwards from an arbitrary boundary. 

In general, we cannot claim any completeness properties for the method, but we have found 
that it is easy to extend with extra information (such as symbolic values for the boundaries). 
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5.3 Inconsistency 

Unfortunately, it is possible to find cases in which the method as described so far returns an incorrect 
equation for a hyperplane. Consider the following “doctored up” version of the gcd specification 4 . 


g[hj] = « + j, if * = o V j - 0 
= g[ij — 1] if * < j 

= g [j, i + j] otherwise 

As a recursive function definition, this will terminate for arbitrary (t, j). Is it schedulable? Perhaps: 
we get 

= ai + bj 

at + bj > ai + bj-b so 0 > -b, {» > 0, j > 0, * < j} 

ai + bj > aj + bi + bj so o(» - j) > bi, {» > 0, j > 0,* > j) 

For the dependence inequality 0 > -6, the specific sample points don’t matter as long as there is 
at least one: in this case, the base point (1, 1) is enough. For the second dependence inequality, 
let us pick the sample points from the constraint-set as we have described. The base point for 
{* > 0,j > 0, t > j} is (2,1); extension points are (3,2) and (3,1). This yields the set of inequalities 

6 > 0, a > 26, a > 36, 2a > 36 

These (augmented by the other equation’s requirement that 0 > —6) are satisfied by the point 
(4,1), giving a time formula of 4i + j. Now consider the point (7,6) which should be computed at 
time 34. (7,6) depends on point (6,13) which is computed at time 37! In other words, the point 
(7,6) depends on a point which has not yet been defined at the time (7,6) is scheduled. 

Clearly such a schedule is in error. The point (7,6) was not adequately represented by the 
chosen sample points; the time formula’s correctness on those points did not imply that it would 
work at the point (7,6). We now extend the method to detect and correct erroneous equations for 
hyperplanes. 

Our basic approach in checking hyperplanes is to search for a counter-example (e.g., the point 
(7,6) above) and add it as a sample point. A counter-example in our context is a data point which 
satisfies a constraint-set, but does not satisfy the associated inequality. If there is no such point, 
the time formula is correct; we can go on to the next problem. If there is such a point then we can 
add it as a new sample point (yielding an additional inequality), and re-solve the integer program. 
If this new integer program can be solved, we will get a new time formula which can then be tested 

4 The careful reader will note that for certain values of » and j, reference to g[j,i+j] will index nonexistent elements 
of g. Thia might be treated aa a runtime error, but we find it preferable to allow error values within an array. The 
desired result may still be well-defined. 
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by looking for yet another counter-example.. In the worst case, we will end up enumerating each 
point as a new sample point, but we will never produce an incorrect equation as the final solution 
because we will always find the counterexample before generating a schedule. 

As a first example, we check the original gcd hyperplane for correctness. Recall that our 
constrained linear inequality 

a{2i - j) > 6(t - j) if t > 0, j > i 

had a solution a ~ 1 and 6=1. We now negate the inequality while maintaining the same 
constraint-set {i > 0,j > i}: the negated inequality l(2i - j) < l(t - j) simplifies to t < 0. 

This is not compatible with the constraint-set given, so there is no (i , j) which satifies the 
negated inequality, and so the time formula is correct. 

Next, we apply the correctness test to the modified gcd . 

• We will start with the tentative solution of a = 4 and 6 = 1 as the coefficients of the time 
formula for a point (t, j). 

• Now we attempt to find a counterexample to some dependence inequality from this axm of 
the conditional. Picking the last dependence inequality, we are solving for an (i, j) pair such 
that 4(t — j) < t and i > j. If a suitable point can be found, we have found an (t,j) pair 
whose computation precedes ( or is simultaneous with) that of some pair on which it depends. 

• The least solution to this pair of inequalities is (4,3), whose time of computation is supposed 
to be 19. Note that (4,3) depends on (3,7), whose time of computation is also 19. 

• Since a contradiction has been found, we add (4, 3) to the list of sample points and re-solve. 

In this case, adding the (i, j) = (4, 3) point will not really solve the problem. Adding this point to 
the sample space will give the solution a = 5 and 6=1, for which a contradiction can be found by 
the (t,j) pair (5,4). We will end up enumerating the entire space of («,j). 

If, instead of finding the least (t,j) satisfying the reversed inequality, we look for the greatest 
values of % and j, we can easily find the correct hyperplane on the first iteration. 

Let i,j €= (1,100). Then, solving 

4(« - j) < *, if « > j 

for the greatest (t ,j) gives us (100,99) as a new sample point. When we add this sample point, we 
get the inequalities 

6 > 0, a > 26, a > 36,2a > 36, a > 1006 

Solving this set for the least a and 6 yields a = 101 and 6 = 1, giving us the hyperplane lOlt + j 
and the inequality 

101 (t - j) > t. 
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When we attempt to find a contradiction, 

101 (t — j) < *, if t > j 

there is no solution, so we know that the hyperplane is correct. 

Note that this last refinement of using the greatest (t,j) pair requires that we know the ranges 
of i and j, which was not required for the earlier examples. The approach of treating the size as 
a symbolic parameter no longer works: the hyperplane schedule we’d really like to see would have 
the form *(j[t, j}) = 10 It +j = (IV + l)t + j, m which N and t are multiplied together; we no longer 
have a strictly linear structure unless N is available as a fixed constant. 

However, at this point we are working with pseudo-parallel expressions of sequential algorithms; 
the dependency graph for g really describes a nested-loop structure 

lor i G 0 ... IV do 

lor j G 0...IV do 

i 1 t = 0 V j = 0 V i < j then g [ t , j] := t + j 
else g[ij}:= g[j,i + j] 

Such a schedule goes beyond the scope of this paper, but examination of it does clarify what 
hyperplanes can and cannot do. The hyperplane construction can find schedules for a surprising 
variety of recursive specifications, including many that do not appear linearly structured at first 
glance. However, the schedule generated is “flat”, not recursive, and cannot dead with a lexico- 
graphic (rather than geometric) ordering of dimensions. It remains to be seen whether there are 
natural extensions of the hyperplane method which can make full use of the nested-block structure 
of the programming language in the schedules which it generates. 

6 Conclusions and Future Work 

In this paper we have presented a new method for synthesizing parallel schedules for recursive 
equations. The method can be used in the presence of irregularly structured dependency vectors 
and mutually recursive equations. Although certain recurrences may cause the method to report an 
incorrect hyperplane, this can easily be checked and sometimes corrected. Portions of this have been 
implemented in the PS automatic program generator as reported in Gokhale (1986b), Samet (1984), 
and Torgersen (1986 ) I In general, the methods make frequent use of integer programming, which 
even when restricted to linear systems (as it usually is) tends to be very expensive. In practice, 
however, we find that algebraic simplification can reduce the number of distinct inequalities to the 
extent that there is relatively little work to do. 

Our primary problems now are divided into four areas. 

• First, we hope to develop a sample- point strategy which will prevent the method from re- 
porting incorrect hyperplanes. Since we can detect and correct these already, this is not of 
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any immediate practical importance, but we hope that by doing so we can get some insight 
into the range of usable schedule generations. We. conjecture that this can be achieved by 
considering the convex set of points represented by each constraint-set, and using an “earli- 
est” vertex (a vertex for which the time value is minimal) as base point; the extension points 
should then be found along the edges going out from this point. 

• Secondly, we would like to generalize the scheduling process to create nested blocks of code 
when necessary. A matrix in which the beginning of each row depends on the end of its 
predecessor should turn into a nested loop structure, not a single hyperplane. We can’t deal 
with this at all yet. 

• Third, we would like to deal with dependencies which satisfy an ordering, but do not come 
from simple sums and differences; a gcd definition based on gcd[i,j] — gcd[j mod i, t] is beyond 
our present methods, as is V[i] = V’[L[»]] + V[iZ[i]]. 

• Finally, we would like to incorporate other transformations. For example, it cam be quite 
simple to decompose a regular array structure into blocks for processor allocation; we have 
worked on this as a transformation on the specifications with some success. 
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